Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework gRPC status based on new rules #1308

Merged
merged 3 commits into from
Nov 2, 2020

Conversation

alertedsnake
Copy link
Contributor

@alertedsnake alertedsnake commented Oct 30, 2020

Description

As of #1214, the status codes changed and no longer line up with gRPC status codes, so now we'll just set StatusCode.ERROR and store the actual gRPC status code in the trace as grpc.status_code.

Type of change

Please delete options that are not relevant.

  • [ X] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Mostly observation in Jaeger.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

As of open-telemetry#1214, the status codes changed and no longer line up with gRPC
status codes, so now we'll just set `StatusCode.ERROR` and store the
actual gRPC status code in the trace as `grpc.status_code`.
@alertedsnake alertedsnake requested review from a team, codeboten and lzchen and removed request for a team October 30, 2020 02:12
@alertedsnake
Copy link
Contributor Author

@codeboten this should take care of your concern mentioned on #1171

Copy link
Member

@toumorokoshi toumorokoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I can be overriden here, but I think a unit test minimum is warranted.

self._active_span.set_status(
Status(status_code=StatusCode(code.value[0]), description=details)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some unit tests here? I'm a little surprised to see that there's no failing test for a significant change like make the code static.

In fact, it's a bit of a surprise that this wasn't caught when there's a range of invalid codes coming from the proto since the change. But either way, a unit test seems warranted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't caught because there aren't any real tests for error conditions in this instrumentation.

I wouldn't be opposed to working on that part also, but being thorough is going to be a bunch of work, and it seemed prudent to add a quick fix since the previous PR now violates the recently-changed spec.

@@ -125,18 +126,16 @@ def set_code(self, code):
self.code = code
# use details if we already have it, otherwise the status description
details = self.details or code.value[1]
self._active_span.set_attribute("rpc.status_code", code.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a semantic convention from the spec? I don't see an entry there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the looks of it, the value will probably be rpc.grpc.status_code

Copy link
Contributor Author

@alertedsnake alertedsnake Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't either, but then, the spec you linked is now out of date, specifically this part:

Implementations MUST set status which MUST be the same as the gRPC client/server status. The mapping between gRPC canonical codes and OpenTelemetry status codes is 1:1 as OpenTelemetry canonical codes is just a snapshot of grpc codes which can be found here.

If there's a more appropriate way to provide this data in the trace, I'd be happy to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, see open-telemetry/opentelemetry-specification#1044 - I didn't realize this discussion was going on when I made this change.

I'd be happy to remove the extra attribute until that's resolved, but not having the status code in a trace makes the trace significantly less useful for real use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I complied with your change to that spec :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I was about to ask this 😄

Copy link
Contributor

@NathanielRN NathanielRN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this quick change!

@@ -189,6 +188,7 @@ def _start_span(self, handler_call_details, context):
attributes = {
"rpc.method": handler_call_details.method,
"rpc.system": "grpc",
"rpc.grpc.status_code": grpc.StatusCode.OK,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious as to why we set grpc.StatusCode.OK before the call has completed? Is the expectation that we assume it is good and if it does fail later then this status code will be replaced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't call abort() or otherwise set the status code, then it's not an error, and so OK is the default response.
I mean, as far as this interceptor can tell - we can't actually hook this at the true end of the call stack because interceptors don't really work that way.

status_code=StatusCode(self.code.value[0]),
description=details,
)
Status(status_code=StatusCode.ERROR, description=details)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that everywhere we use set_details it's to add an error message so this change makes sense.

Although I feel like a good change (may or may not be for this PR) would be to have the method renamed to set_error_details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is a subclass of grpc.ServicerContext and so we should probably stay compliant with that API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's subclassing servicer context so can't be changed https://grpc.github.io/grpc/python/grpc.html#grpc.ServicerContext.set_details

@@ -113,8 +113,9 @@ def set_trailing_metadata(self, *args, **kwargs):
def abort(self, code, details):
self.code = code
self.details = details
self._active_span.set_attribute("rpc.grpc.status_code", code.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change (i.e. OK instead of just 0), but I'm wondering if we should record the number code.value[0] as well in another attribute. I'm okay with not recording it, just noting that it was recorded before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's still a debate as to whether this should be the numeric code or the text version, the lean is numeric because of language differences. I implemented this before the last comment there, and so was going to hold off until that PR is accepted first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this attribute being added to the spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status_code=StatusCode(self.code.value[0]),
description=details,
)
Status(status_code=StatusCode.ERROR, description=details)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's subclassing servicer context so can't be changed https://grpc.github.io/grpc/python/grpc.html#grpc.ServicerContext.set_details

@@ -113,8 +113,9 @@ def set_trailing_metadata(self, *args, **kwargs):
def abort(self, code, details):
self.code = code
self.details = details
self._active_span.set_attribute("rpc.grpc.status_code", code.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this attribute being added to the spec?

Copy link
Member

@toumorokoshi toumorokoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whelp, I forgot to submit pending review changes. not needed now, but I'll post them anyway.

@@ -125,18 +126,16 @@ def set_code(self, code):
self.code = code
# use details if we already have it, otherwise the status description
details = self.details or code.value[1]
self._active_span.set_attribute("rpc.status_code", code.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the looks of it, the value will probably be rpc.grpc.status_code

@@ -125,18 +126,16 @@ def set_code(self, code):
self.code = code
# use details if we already have it, otherwise the status description
details = self.details or code.value[1]
self._active_span.set_attribute("rpc.status_code", code.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@toumorokoshi toumorokoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for addressing changes.

srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this pull request Nov 1, 2020
srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this pull request Nov 1, 2020
* feat: graceful shutdown for tracing and metrics

* fix: wording in test case

* fix: typo

* fix meterprovider config to use bracket notation

Co-authored-by: Daniel Dyla <[email protected]>

* fix meterprovider config to use bracket notation

Co-authored-by: Daniel Dyla <[email protected]>

* fix: add callbacks to shutdown methods

* fix: merge conflict

* simplify meter shutdown code

Co-authored-by: Daniel Dyla <[email protected]>

* fix: fix one-liner

* private function name style fix

Co-authored-by: Daniel Dyla <[email protected]>

* fix: naming of private member variables

* fix: graceful shutdown now works in browser

* fix: window event listener will trigger once

* fix: modify global shutdown helper functions

* fix: remove callback from remove listener args

* fix: change global shutdown function names and simplify functionality

* fix: add rest of function refactoring and simplification

* fix: remove unintended code snippet

* fix: refactor naming of listener cleanup function and fix sandbox issue

* fix: make global shutdown cleanup local

* fix: change interval of MeterProvider collection to ensure it does not trigger through clock

* chore: removing _cleanupGlobalShutdownListeners

* fix: remove unnecesary trace provider member function

* Removing default span attributes (open-telemetry#1342)

* refactor(opentelemetry-tracing): removing default span attributes

Signed-off-by: Aravin Sivakumar <[email protected]>

* refactor(opentelemetry-tracing): removing default span attributed from tracer object

Signed-off-by: Aravin Sivakumar <[email protected]>

* refactor(opentelemetry-tracing): removing accidental add to package.json

Signed-off-by: Aravin Sivakumar <[email protected]>

* refactor(opentelemetry-tracing): removing redundant test and fixing suggestions by Shawn and Daniel

Signed-off-by: Aravin Sivakumar <[email protected]>

* feat: add baggage support to the opentracing shim (open-telemetry#918)

Co-authored-by: Mayur Kale <[email protected]>

* Add nodejs sdk package (open-telemetry#1187)

Co-authored-by: Naseem <[email protected]>
Co-authored-by: legendecas <[email protected]>
Co-authored-by: Mark Wolff <[email protected]>
Co-authored-by: Matthew Wear <[email protected]>

* feat: add OTEL_LOG_LEVEL env var (open-telemetry#974)

* Proto update to latest to support arrays and maps (#1339)

* chore: 0.10.0 release proposal (open-telemetry#1345)

* fix: add missing grpc-js index (open-telemetry#1358)

* chore: 0.10.1 release proposal (open-telemetry#1359)

* feat(api/context-base): change compile target to es5 (open-telemetry#1368)

* Feat: Make ID generator configurable (#1331)

Co-authored-by: Daniel Dyla <[email protected]>

* fix: require grpc-js instead of grpc in grpc-js example (open-telemetry#1364)

Co-authored-by: Bartlomiej Obecny <[email protected]>

* chore(deps): update all non-major dependencies (open-telemetry#1371)

* chore: bump metapackage dependencies (open-telemetry#1383)

* chore: 0.10.2 proposal (open-telemetry#1382)

* fix: remove unnecesary trace provider member function

* refactor(metrics): distinguish different aggregator types (open-telemetry#1325)

Co-authored-by: Daniel Dyla <[email protected]>

* Propagate b3 parentspanid and debug flag (open-telemetry#1346)

* feat: Export MinMaxLastSumCountAggregator metrics to the collector as Summary (open-telemetry#1320)

Co-authored-by: Daniel Dyla <[email protected]>

* feat: Collector Metric Exporter for the Web (open-telemetry#1308)

Co-authored-by: Daniel Dyla <[email protected]>

* Fix issues in TypeScript getting started example code (open-telemetry#1374)

Co-authored-by: Daniel Dyla <[email protected]>

* chore: deploy canary releases (open-telemetry#1384)

* fix: protos pull

* fix: address marius' feedback

* chore: deleting removeAllListeners from prometheus, fixing tests, cleanu of events when using shutdown notifier

* fix: add documentation and cleanup code

* fix: remove async label from shutdown and cleanup test case

* fix: update controller collect to return promise

* fix: make downsides of disabling graceful shutdown more apparent

Co-authored-by: Daniel Dyla <[email protected]>
Co-authored-by: Bartlomiej Obecny <[email protected]>
Co-authored-by: Aravin <[email protected]>
Co-authored-by: Ruben Vargas Palma <[email protected]>
Co-authored-by: Mayur Kale <[email protected]>
Co-authored-by: Naseem <[email protected]>
Co-authored-by: legendecas <[email protected]>
Co-authored-by: Mark Wolff <[email protected]>
Co-authored-by: Matthew Wear <[email protected]>
Co-authored-by: Naseem <[email protected]>
Co-authored-by: Mark Wolff <[email protected]>
Co-authored-by: Cong Zou <[email protected]>
Co-authored-by: Reginald McDonald <[email protected]>
Co-authored-by: WhiteSource Renovate <[email protected]>
Co-authored-by: srjames90 <[email protected]>
Co-authored-by: David W <[email protected]>
Co-authored-by: Mick Dekkers <[email protected]>
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants